Is Web Scraping Legal?

How to evaluate the legality of web scraping
for your research

Agenda

  1. Motivation and disclaimer
  2. Overview of legal situation
  3. Step-by-step plan to evaluate legality (or, to improve it)

1. Motivation & Disclaimer

Image: by pikisuperstar on Freepik

Motivation

  • In the past, getting data for research was challenging and expensive, but digital technology and the internet have made a vast amount of data easily accessible online.
  • This data offers real-time insights into various societal processes, relationships, and interactions, enabling researchers to answer research questions more accurately and efficiently[1].
  • Web scraping (also called text and data mining (TDM) by legal scholars) is a data collection method that allows for the automated extraction of insights and information from large amounts of text or data resources[2].
  • The information acquired through TDM can be used to address various scientific and societal challenges, such as tracking the spread of diseases like COVID-19[3].

Researchers require web scraping to generate data about the digital economy.

Disclaimer

  • I’m not a lawyer (but this has been duely compiled, on the basis of legal resources)
  • The legal situation is still changing rapidly, and this deck may be outdated (soon)

Image: Investopedia / Joules Garcia

Database Rights

Database rights are a subset of copyright. A database is an organized collection of materials that allows users to search and access individual pieces of information.

Copyright law protects databases when the way the data is selected or arranged is original and creative. Therefore, scraping cannot result in copying and, for example, republishing the original database’s structure (or a substantial part of it)[11].

Non-original databases can also be protected if a significant investment was made in obtaining, verifying, and presenting the data[12].

When scraping a data source that may be subject to database rights, consider:

  • scraping only some data,
  • scraping only the data itself (without the replication of the organization of that data),
  • limiting the data scraped to factual or non-copyrighted data[6].

The TDM Exception

  • The new “Digital Single Market” (DSM) directive on copyright permits scraping (reproduction and extraction) of data from databases for the purposes of text and data mining, even if the data is protected by copyright.
  • However, the exception is limited: database owners have the option to restrict the reproduction and extraction of databases and their content.
  • That restriction must be implemented in such a way that bots and crawlers, among others, can see it (therefore, on a website there should be installed for example a special program communicating visiting scraping programs that scraping is prohibited).
  • Therefore, unless you take a large amount of data/structure and later republish or sell it, there is a good chance you will not violate any intellectual property rights[11].

Terms of Service

A Terms of Service (ToS), Terms of Use, or Privacy Policy can be found on almost every website.

Therefore, the following question arises: Do we have to abide by the website’s Terms of Services?

Clickwrap vs Browsewrap

Clickwrap:
ToS that you must explicitly agree to.
Browsewrap:
ToS that are buried on the site.
- If you have to explicitly agree to the ToS in any way (such as by logging in, clicking ‘I agree’ or ‘OK’, or downloading the app), these are click wrap ToS. - These ToS are usually accessible via a link at the bottom of a webpage.
- You are informed of the existence of the ToS, and you are actively agreeing to them. - They state that you agree to the terms simply by using or browsing the site.
- Courts have ruled that your explicit agreement creates a binding contract that you must follow. - Most courts have ruled that this type of ToS is unenforceable, so even if the terms forbid you from using the service, you may not be in violation of them[8].

Personal Data

Image: by vectorjuice on Freepik

What is Personal Data?

  • Personal data, or personally identifiable information (PII), refers to any data that can directly or indirectly identify an individual.
  • Common types of personal data include name, email, phone number, address, user name, IP address, date of birth, employment information, bank/credit card information, medical data, and biometric data.
  • Public personal data is not an exception so analysis must be conducted before scraping it.
  • Different legal jurisdictions (US, EU, Canada, Australia etc.) have different regulations for personal data so you need to identify the jurisdiction in which the data owners reside[6].

GDPR Compliance

  • The General Data Protection Regulation (GDPR) covers the European Economic Area (EEA).
  • Although it was drafted and passed by the European Union (EU), it imposes obligations on all organisations that target or gather information about EU citizens[14].
  • To use or store personal data of EU citizens, a company or an organisation must comply with one or more of the legal reasons described by GDPR. You may, therefore, need either to:
    • receive a consent of a person of which data you’re going to process,
    • be performing a task carried out in the public interest or, as a controller or even as a third party,
    • have a legitimate interest to process the data and such processing is necessary to achieve that interest[11].

Lawful Reasons (GDPR)

  • Contract - an organisation has a contract with the data subject (person whose data we have), for example, a contract to supply goods or services or an employment contract.
  • Legal Obligation - an organisation is fulfilling a legal obligation, such as when data processing is required by law.
  • Public Task - to complete a public task, mostly relating to the tasks of public administrations such as schools, hospitals, and municipalities.
  • Vital Interest - when data processing is in the data subject’s vital interests, for example, when this might protect their life.
  • Legitimate Interest - for example, when a bank uses personal data to determine whether the data subject is eligible for a higher-interest savings account.

In all other cases, the company or organisation must obtain the data subject’s permission (known as “consent”) before collecting or reusing their personal information[15].

Type of Personal Data (GDPR)

Next to establishing the lawful reason for scraping data, you should also consider the type of personal data being collected. Sensitive data is a subject to additional rules and requires explicit consent to be given for this data to be scraped and stored.

Sensitive data includes:

  • racial or ethnic origins,
  • genetic data,
  • political opinions,
  • biometric data that can uniquely identifying someone,
  • religious or philosophical beliefs,
  • health, sex life or sexual orientation data,
  • or trade union membership.

Therefore, you should avoid scraping this data unless you have explicit consent and legitimate reason to do so[16].

Data Protection Principles (GDPR)

If you process data, you have to do so according to the data protection principles:

  • Lawfulness, fairness and transparency - Processing must be lawful, fair, and transparent to the data subject.
  • Purpose limitation - You must process data for the legitimate purposes specified explicitly to the data subject when you collected it.
  • Data minimization - You should collect and process only as much data as absolutely necessary for the purposes specified.
  • Accuracy - You must keep personal data accurate and up to date.
  • Storage limitation - You may only store personally identifying data for as long as necessary for the specified purpose.
  • Integrity and confidentiality - Processing must be done in such a way as to ensure appropriate security, integrity, and confidentiality (e.g. by using encryption).
  • Accountability - The data controller is responsible for being able to demonstrate GDPR compliance with all of these principles[14].

Data Policies (GDPR)

Even when you received an explicit consent from the data subject, you need to ensure that the correct data retention and access policies are in place:

  • Ensure that data subjects are aware of the company’s data protection and privacy policy.
  • Comply with Data Subject Access Rights (DSAR), including the right to withdraw consent, request a copy of data, or request deletion of data.
  • If consent is withdrawn or a DSAR for deletion is received, delete or anonymize the personal data as it is no longer legally justifiable to retain it[16].

Residential IPs (GDPR)

Residential proxies provide real IP addresses of actual devices. When using a residential IP for scraping (or even just accessing web pages), you appear to be accessing websites and social media platforms from an actual home-based IP[17].

  • IP addresses are considered personally identifiable information (PII) under the GDPR regulation.
  • Ensure that any EU residential IPs used as proxies are GDPR compliant.
  • Obtain explicit consent from the owner of a residential IP before using it as a web scraping proxy.
  • If obtaining residential proxies from a third-party provider, ensure that they have obtained consent and are in compliance with GDPR before using the proxy[16].

3. Evaluate the Legality
of Your Scraping Project

Image: Pixabay

Step 1: Territorial Scope

Consider territorial scope when evaluating web scraping legality for compliance with relevant jurisdiction laws.

  • Check the jurisdiction of the website: Determine the location of the website and the jurisdiction under which it operates.
  • Check the jurisdiction of the data subject: If the website contains personal data, determine the jurisdiction of the data subject.
  • Check the jurisdiction of the scraper: Determine the location of the scraper and the jurisdiction under which it operates.
  • Consider cross-border data transfer restrictions: Some countries have restrictions on the transfer of personal data outside of their jurisdiction. Make sure that the data being scraped is being transferred in a legal manner.

Step 2: Personal Data

  • Is personal data involved?
  • If you collect or hold data of EU citizens, do you have the lawful basis for processing based on one of the following conditions?
    • consent of the data subject;
    • contract with the data subject;
    • necessary for compliance with a legal obligation;
    • necessary in order to protect the vital interests of the data subject or a third party;
    • necessary for the performance of a task in the public interest or in the exercise of official authority vested in the controller;
    • necessary for the purposes of the legitimate interests pursued by the controller or by a third party, except where such interests are overridden by the rights of data subject.

Step 3: The Type and Use of Data

  • Do you really need personal data? You should always anonymize the personal data if there is an option to do so.
  • Is the data considered to be sensitive? Scraping sensitive data entails complying with additional rules and obtaining specific consent for its scraping and storage. Avoid scraping it without a legitimate reason and clear explicit consent.
  • What is the extent of the proposed data collection? An important aspect of GDPR is that companies should only collect and handle the minimum amount of data necessary to successfully perform a specific task.
  • How you plan to use the data post-extraction? Under GDPR you need to have a clear and legal reason for scraping data and be able to demonstrate that it will be used for legitimate purposes.

Step 5: Terms and Conditions

  • Did you explicitly agree to the Terms of Service in any way? Review the terms and conditions to determine if data extraction would be in breach of these ToS.
  • Are you scraping data behind a login? Logging into a website to extract data can raise legal issues. Logging in typically requires accepting the website’s terms and conditions (T&C) which might explicitly state that automatic data extraction is prohibited.
  • Are you scraping data from a mobile app? By downloading the app, you agree to terms and conditions.

References

1. Krotov, V., Johnson, L., & Silva, L. (2020). Tutorial: Legality and Ethics of Web Scraping. Communications of the Association for Information Systems, 47, 555–581. https://doi.org/10.17705/1CAIS.04724
2. Springer Nature. (2018). Text and Data Mining at Springer Nature. https://www.springernature.com/gp/researchers/text-and-data-mining
3. Fiil-Flynn, S. M., Butler, B., Carroll, M., Cohen-Sasson, O., Craig, C., Guibault, L., Jaszi, P., Jütte, B. J., Katz, A., Quintais, J. P., Margoni, T., Souza, A. R. de, Sag, M., Samberg, R., Schirru, L., Senftleben, M., Tur-Sinai, O., & Contreras, J. L. (2022). Legal reform to enhance global text and data mining research. Science, 378(6623), 951–953. https://doi.org/10.1126/science.add6124
4. Hanson, E., & Kim, H. (2019). hiQ’s preliminary injunction affirmed. A green light for data scraping or not? In White & Case. https://www.whitecase.com/insight-our-thinking/hiqs-preliminary-injunction-affirmed-green-light-data-scraping-or-not
5. Bryan, K., Jacobson, J., Dull, K., & Tse, S. (2022). Federal Court Rules in Favor of LinkedIn’s Breach of Contract Claim after Six Years of CFAA Data Scraping Litigation. In Squire Patton Boggs. https://www.privacyworld.blog/2022/11/federal-court-rules-in-favor-of-linkedins-breach-of-contract-claim-after-six-years-of-cfaa-data-scraping-litigation/
6. Daruwalla, S. (2019). Solution Architecture: Conducting Web Scraping Legal Review. In Zyte. https://www.zyte.com/blog/solution-architecture-part-3-conducting-a-web-scraping-legal-review/
7. U.S. Copyright Office. (2016). What Does Copyright Protect? https://www.copyright.gov/help/faq/faq-protect.html
8. Daruwalla, S. (2021). Is Web & Data Scraping Legally Allowed? In Zyte. https://www.zyte.com/learn/is-web-scraping-legal/
9. U.S. Copyright Office. (2023). U.S. Copyright Office Fair Use Index. https://www.copyright.gov/fair-use/
10. Witzel Erb Backu & Partner Rechtsanwälte mbB. (2021). Snapshot: The scope of copyright in European Union. In Lexology. https://www.lexology.com/library/detail.aspx?g=1013cebf-9e8e-41a0-bac0-595f0e04133b
11. Szwed, P. (2021). Is web scraping legal? A short guide on scraping under EU law. In Discover Digital Law. https://discoverdigitallaw.com/is-web-scraping-legal-short-guide-on-scraping-under-the-eu-jurisdiction/
12. Commission, E. (2022). Protection of databases. In Shaping Europe’s digital future. https://digital-strategy.ec.europa.eu/en/policies/protection-databases
13. Růžičková, L. (2022). Are website terms of use enforced? In Apify. https://blog.apify.com/enforceability-of-terms-of-use/
14. Wolford, B. (2018). What is GDPR, the EU’s new data protection law? In GDPR.eu. https://gdpr.eu/what-is-gdpr/
15. Your Europe. (2022). Data protection and online privacy. https://europa.eu/youreurope/citizens/consumers/internet-telecoms/data-protection-online-privacy/index_en.htm
16. Daruwalla, S. (2018). GDPR Compliance For Web Scrapers: The Step-by-step Guide. In Zyte. https://www.zyte.com/blog/web-scraping-gdpr-compliance-guide/
17. NetNut. (2020). Residential Proxies for Scraping Web Data. https://netnut.io/10-reasons-to-use-residential-proxies-for-scraping/